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Abstract 

The paper gives an overview of learner corpora and their application to second language learning and teaching. It 
is proposed that there are four core components in learner corpus research, namely, corpus linguistics expertise, a 
good background in linguistic theory, knowledge of SLA theory, and a good understanding of foreign language 
teaching issues (Granger, 2009). Based on the above components, the present paper first introduces learner 
corpora, then reviews literature concerning the application of corpus linguistics to SLA by means of contrastive 
interlanguage analysis, and at last discusses the relationship between learner corpora and foreign language 
teaching. 
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1. Introduction 

With the development of learner corpora and multilingual corpora since the 1990s, there has been a revival of 
corpus linguistics, especially its application to second language research by incorporating the use of learner 
corpora, see for instance, Keck (2004), Pravec (2002), Myles (2005). Through investigation of actual language 
use in learner coipora, it is easier for researchers “to understand how best to help students develop competence in 
the kinds of language they will encounter on a regular basis” (Biber & Reppen, 1998: 157). 



Figure 1. Core components of learner corpus research 
(Adopted from Granger, 2009: 15) 


Granger (2009) proposed that there are four core components in learner corpus research, namely, corpus 
linguistics expertise, a good background in linguistic theory, knowledge of SLA theory, and a good 
understanding of foreign language teaching issues, as shown in Figure 1. The present paper is aimed at 
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introducing learner corpora and their applications to second language learning and teaching. Based on the above 
components, in the present paper, we will look at 1) the introduction of learner corpora, 2) the application of 
corpus linguistics to SLA by means of contrastive interlanguage analysis, and 3) the relationship between learner 
corpora and foreign language teaching. 

2. Literature Review 

2.1 Introduction of Learner Corpora 

Granger (2002: 7) provides a detailed definition of learner corpora: 

Computer learner corpora are electronic collections of authentic FL/SL textual data assembled according to 
explicit design criteria for a particular SLA/FLT purpose. They are encoded in a standardized and homogeneous 
way and documented as to their origin and provenance. 

There are a few keywords in this definition that are worth mentioning - “authentic", “textual data" and “explicit 
design criteria”. 

First of all, in terms of authenticity, it is almost impossible for the learner data to be completely natural, for the 
reason that foreign language teaching activities are inevitably involving some degree of “artificiality” (Granger, 
2002: 8). As long as essay writing is conducted under authentic classroom circumstances, learner corpora of 
essay writing can be regarded as authentic written data. Besides that, learner corpora should contain textual data 
consisting of continuous “stretches of discourse”, rather than, for example, lists of disconnected erroneous 
sentences. In addition, special attention must be paid to the criteria on which the learner corpus is built. Apart 
from the same compiling rules as the native corpora, factors like the characteristics of the learner and the task 
settings should also be taken into consideration. 

A series of learner corpora have been released all around the world and made use of for research, for instance, 
International Corpus of Learner English (ICLE) which contains argumentative essays written by higher 
intermediate to advanced learners of English from various mother tongue backgrounds; Louvain International 
Database of Spoken English Interlanguage (LINDSEI), a spoken counterpart to ICLE containing oral data 
produced by advanced learners of English from several mother tongue backgrounds; Chinese Learner English 
Corpus (CLEC), a collection of English essays written by Chinese students ranging from senior middle school to 
university levels; Cambridge Learner Corpus consisting of exam scripts written by students taking Cambridge 
ESOL exams around the world, and many others. 

2.2 Contrastive Interlanguage Analysis 

It is advocated by Granger (1998) that it is of great significance to analyze learner corpora in second language 
acquisition studies. She maintains that learners’ performance can be analyzed in corpus to infer the invisible 
mental process of SLA, and that previous hypotheses generated from the psycholinguistic approach can be tested 
through analysis of learner corpora. 

When applying corpus linguistics to SLA, one type of methods is usually adopted, that is, Contrastive 
Interlanguage Analysis (CIA). CIA, both in quantitative and qualitative terms, refers to two different types of 
comparison: one between native language and learner language (LI vs. L2), while the other between different 
varieties of interlanguage (L2 vs. L2) (Granger, 2009: 18). 

In spite of the fact that controversies still exist with regard to L1 vs. L2 comparison, it is unreasonable to ignore 
its significance for describing the features of non-nativeness in learner writing and speech. A series of learner 
corpus studies have been conducted by taking a CIA approach, such as Altenberg and Granger (2001), Housen 
(2002), Nesselhauf (2005), Adel (2006), Xiao (2007), etc., which have revealed a number of interlanguage 
features in various linguistic environments. Take the investigation of the high-frequency verb MAKE in 
Altenberg and Granger (2001) for instance, the findings suggested that learners even at the advanced level are 
still “at a risk of having a very crude knowledge of their grammatical and lexical patterning” without ruling out 
the “skeleton” entries held for high-frequency verbs (ibid.: 190). 

One of the major limitations of traditional error analysis is that it “fails to provide a complete picture of learner 
language”, and researchers “need to know what learners do correctly as well as what they do wrongly” (Ellis 
1994: 67). It was also suggested by Schachter and Celce-Murcia (1977) that investigators should treat causes of 
error very cautiously, for in many cases, “what we see happening, however, is just the reverse”. Current 
interlanguage research is different from traditional error analysis in that the current CIA approach treats learner 
performance data in its own right rather than in respect of merely decontexualized errors (Granger, 1998). 
Therefore, it is more likely to yield rewarding results by comparing second language learners’ interlanguage with 
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the target native language. In my study, I will not concentrate on errors that learners have made, but focus on 
systematic analysis of the ditransitive constructions that they use. 

3. Learner Corpora and Foreign Language Teaching (FLT) 

Foreign language teaching has benefited from learner corpus linguistics research, and there is already general 
agreement that corpus data, especially the learner corpus data “opens up interesting descriptive and pedagogical 
perspectives” with “a profound and positive impact on the field of FLT” (Granger, 2002: 21). The two areas 
which gained most from corpus-based research are materials design and classroom teaching methodology. 
Literature in this section is not limited to learner corpora, but also covers the application of general native 
corpora to FLT. 

3.1 Teaching Materials Design 

Fast-paced progress has been made in developing such materials as EFL dictionaries, grammar references, and 
textbooks with the help of large-scale cotpora, although the influence in dictionaries is more significant than in 
two other areas. 

Compilation of Dictionaries 

Recent years have shown an increasing trend that learners’ dictionaries of English are compiled with reference to 
updated databases of language. By taking into consideration the frequency information from large-scale native 
corpora, a number of English dictionaries for advanced L2 learners have been compiled. Dictionaries of this type 
include Oxford Advanced Learners’ Dictionary, Collins Cobuild Dictionary, and Longman Dictionary of 
Contemporary English, etc. (Leech, 2001: 329). These dictionaries can provide detailed information about the 
ranking of meanings, collocations, grammatical patterns, style and frequency (Granger, 2002: 21). 

Gillard and Gadsby (1998) compiled The Longman Essential Activator, a dictionary consulting The Longman 
Learners ’ Corpus (LLC), with the aim of helping L2 learners of English to accurately and naturally produce a 
wider range of words and phrases, rather than heavily rely on a limited number of common words. The authors 
generated frequency lists from LLC, which were used to help compilers make decisions of what should be 
included in the dictionary. In this dictionary, they gave very detailed information of each word accompanied by 
near-synonyms under about 1000 ‘concepts’. For instance, regarding the ‘concept’ of WALK, words like stroll, 
stride, amble, and jog are listed together with definitions and examples, for the purpose of making it easier for 
learners to distinguish these words. In addition, based on the frequently occurring errors common to all learners, 
they made use of ‘help boxes’ to remind learners not to make similar errors in their use of English. Gillard and 
Gadsby (1998: 163) believe that “by having constant access to a very large body of students’ writing, 
lexicographers are sensitized to and reminded of the needs of their audience far more thoroughly than they could 
achieve through their previous teaching experience”. Their practice provides much insight for dictionary 
compilers to take the features of learner English into account. 

Enhancement of textbooks 

Before the advent of learner corpora, teaching materials were mainly based on the English language teachers’ 
experience and intuition in deciding what should be taught to students. It was therefore quite difficult for 
compilers to check whether the teaching materials could meet learners’ needs (Guo, 2006: 233). In the past two 
decades, a series of corpus studies have been conducted to test the effectiveness of the materials used in foreign 
language teaching, including Grabowski and Mindt (1995) on irregular verbs, Barlow (1996) on reflexives, 
Mindt (1997) on future time expressions, Conrad (2004) on linking adverbials, Romer (2004) on modal verbs, 
etc. Abundant evidence has been found from these studies that “the language presented in textbooks is frequently 
still based on intuitions about how we use language, rather than actual evidence of use” (O'Keeffe, McCarthy, & 
Carter, 2007: 21). There is, therefore, a great need for “revised pedagogical language descriptions that take 
corpus findings into account and present a more adequate picture of language as it is actually used” (Romer, 
2010 : 22 ). 

Romer (2005) made a comparison between the use of progressives in native spoken English corpora (British 
National Corpus and Bank of English) and in representations of spoken English used in German EFL textbooks. 
It was found that 30%-40% of progressives are used to indicate repeated actions or events in native corpora, 
while repeatedness is seldom expressed by progressives in textbooks, where more than 90% of the progressives 
refer to single continuous events, as in What are you doing? or What have you been doing? 

These descriptions may “address the described imbalance of functions and contexts in which progressives are 
used in real conversations and textbooks, use authentic instead of invented examples, and focus on frequent 
instead of rarely attested patterns” (Romer, 2010: 24). 
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An early attempt of applying corpus linguistics to course books was Collins COBUILD English Course (CCEC), 
designed by Willis and Willis (1989). It was a ‘lexical syllabus’ focusing on “the commonest words and phrases 
in English and their meanings” (Willis, 1990: 124). Another pioneering and promising work was Touchstone 
series published by Cambridge University Press (McCarthy, McCarten, & Sandiford, 2005). This series of 
corpora-based EFL textbooks have incorporated research findings from the Cambridge International Corpora, 
and “present(ed) the vocabulary, grammar, and functions students need for effective conversations”. 

Based on the investigation into ICLE, Kaszubski (1998) made recommendations for the traditional writing 
textbooks used in Poland by providing specific information as below (ibid.: 183): 

a. longer lists of synonymous items, accompanied with frequency band information, register/style description, 
and (gradable) overuse/underuse/misuse warnings (if applicable). In cases of misuse, Polish and NS contrasting 
samples could be given; 

b. [...] lists of common collocations, with additional information on contrasts between Polish and NS use; 

c. listings of commonly misused words and phrases as well as examples of serious over- and underuse. 

These suggestions are not only applicable to textbook writers in Poland, but also useful for textbook writers from 
other countries. 

3.2 Classroom Teaching Methodology’ 

As for the teaching methodology, data-driven learning (DDL) has been highly recommended by many 
researchers (e.g. Cobb, 1997; Johns, 2002; Johns & King, 1991). It mainly refers to “the use in the classroom of 
computer-generated concordances to get students to explore regularities of patterning in the target language and 
the development of activities and exercises” (Johns & King, 1991: iii). It is an inductive approach relying on an 
“ability to see patterning in the target language and to form generalisations” about language form and use (Johns, 
1991: 2). DDL is characterized by great emphasis on the fields of lexis and lexico-grammar of the activities, and 
the idea that learners should be exposed to as much authentic native speaker data as possible. Johns (2002: 108) 
sees DDL as a process which “confronts the leaner as directly as possible with the data” “to make the learner a 
linguistic researcher”. 

Other advantages of DDL also include that learners may have access to the errors they have made and to what is 
correct and valid; DDL activities reinforce negotiation, interactivity and interaction (Meunier, 2002: 134). Cobb 
(1997) did a longitudinal study of vocabulary acquisition by means of concordance line tasks drawn from a 
specially designed corpus, and showed positive effect of DDL activities on L2 learning. 

However, different voices have called this method into question. For example, it is time-consuming to design 
DDL activities; it requires a considerable amount of preparation on the part of teachers; various types of 
strategies may cause confusion or problems for students; and many researchers doubt the role of DDL in 
low-proficiency learners. 

In spite of different perspectives toward DDL, it is commonly believed that DDL can be used wisely to facilitate 
language teaching in the classroom by raising language awareness (Hawkins 1984) and self-discovery. Among 
various methodologies, concordance-based exercises have been proved to be an effective complement to 
traditional teaching strategies (Granger, 2002, 2009; Meunier, 2002). 

In terms of concordance-based exercises, not only native data concordancing, but also comparison between 
learner and native speaker data can be useful methods. As Nesselhauf (2004: 140) suggests, one of the 
advantages of using such comparison is “that asking learners to look for mistakes, or rather for differences in 
learner and native speaker language, can increase learner autonomy and train the learners’ general ability to 
notice such differences. In addition, such a procedure might also lead to a more positive attitude towards 
mistakes, because mistakes are then no longer merely a feature that has to be corrected, but also a feature that 
can be discovered”. Nesselhauf also calls for more empirical studies to investigate such issues as“for which areas, 
for which learners and with what procedures data-driven learning with learner corpora is most efficient” (ibid.: 
144). 

4. Discussion and Conclusion 

The paper has given an overview of learner corpora and their application to second language learning and 
teaching. It can be seen that learner corpora can play an important role in second language learning research, and 
be of great use to teaching materials design and classroom teaching. 

Exercises from the data-driven learning approach gives learners access to authentic language samples 
accompanied by rich contexts. Learners can do the exploration of the use of words and phrases under the 
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guidance of teachers, and become increasingly aware of native language use through better noticing. Considering 
a wide variation in terms of aptitude, motivation, cognitive style, and other factors among different learners, 
teachers should treat DDL exercises with due caution. 

As Granger and Tribble (1998: 209) suggest, “concordances need to be carefully edited to help learners find the 
relevant features. If vast quantities of information is thrown at learners, there is a considerable risk that DDL 
activities can become time-consuming and frustrating for learners.” Furthermore, concordance-based exercises 
“are by no means a replacement for, but could be viewed as complementary to, the traditional, continuous cloze 
passages, learning of vocabulary through semantic fields, analysis of common roots etc.” (Packard, 1994: 
221-222). Despite the word of caution, it still remains as an important task for corpus linguistics researchers to 
design more various types of DDL materials “that address particular language items (especially items which 
cause constant problems for learners) and that could be used directly in the EFL classroom” (Romer, 2009: 91). 
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